Author: Anjum Sayed. My prediction approach is as follows:
There are many ways to improve on my method. See the future work at the end section for ideas.
First, we will examine the data set we will use to train the classifier. The training data is contained in the file facies_vectors.csv
. The dataset consists of 5 wireline log measurements, two indicator variables and a facies label at half foot intervals. In machine learning terminology, each log measurement is a feature vector that maps a set of 'features' (the log measurements) to a class (the facies type). We will use the pandas library to load the data into a dataframe, which provides a convenient data structure to work with well log data.
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None
filename = 'facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data
Out[1]:
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.
In [2]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()
Out[2]:
In [3]:
# Drop the rows with missing PEF values
training_data.dropna(inplace=True)
In [4]:
training_data.describe()
Out[4]:
This is a quick view of the statistical distribution of the input variables. Looking at the count values, there are 3232 feature vectors in the training set.
These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.
Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors dataframe.
In [5]:
# 1=sandstone 2=c_siltstone 3=f_siltstone
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
facies_color_map[label] = facies_colors[ind]
def label_facies(row, labels):
return labels[ row['Facies'] -1]
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
Let's take a look at the data from individual wells in a more familiar log plot form. We will create plots for the five well log variables, as well as a log for facies labels. The plots are based on the those described in Alessandro Amato del Monte's excellent tutorial.
In [6]:
def make_facies_log_plot(logs, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')
ztop=logs.Depth.min(); zbot=logs.Depth.max()
cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
ax[0].plot(logs.GR, logs.Depth, '-g')
ax[1].plot(logs.ILD_log10, logs.Depth, '-')
ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
ax[4].plot(logs.PE, logs.Depth, '-', color='black')
im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
divider = make_axes_locatable(ax[5])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im, cax=cax)
cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 'SiSh', ' MS ', ' WS ', ' D ', ' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in range(len(ax)-1):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)
ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
ax[5].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
Placing the log plotting code in a function will make it easy to plot the logs from multiples wells, and can be reused later to view the results when we apply the facies classification model to other wells. The function was written to take a list of colors and facies labels as parameters.
We then show log plots for wells SHRIMPLIN
.
In [7]:
make_facies_log_plot(training_data[training_data['Well Name'] == 'SHRIMPLIN'], facies_colors)
In addition to individual wells, we can look at how the various facies are represented by the entire training set. Let's plot a histogram of the number of training examples for each facies class.
In [8]:
#count the number of unique entries for each facies, sort them by
#facies number (instead of by number of entries)
facies_counts = training_data['Facies'].value_counts().sort_index()
#use facies labels to index each count
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors, title='Distribution of Training Data by Facies')
facies_counts
Out[8]:
This shows the distribution of examples by facies for the examples in the training set. Dolomite (facies 7) has the fewest with 81 examples. Depending on the performance of the classifier we are going to train, we may consider getting more examples of these facies.
Crossplots are a familiar tool in the geosciences to visualize how two properties vary with rock type. This dataset contains 5 log variables, and scatter matrix can help to quickly visualize the variation between the all the variables in the dataset. We can employ the very useful Seaborn library to quickly create a nice looking scatter matrix. Each pane in the plot shows the relationship between two of the variables on the x and y axis, with each point is colored according to its facies. The same colormap is used to represent the 9 facies.
In [9]:
#save plot display settings to change back to when done plotting with seaborn
inline_rc = dict(mpl.rcParams)
import seaborn as sns
sns.set()
sns.pairplot(training_data.drop(['Well Name','Facies','Formation','Depth','NM_M','RELPOS'],axis=1),
hue='FaciesLabels', palette=facies_color_map,
hue_order=list(reversed(facies_labels)))
#switch back to default matplotlib plot style
mpl.rcParams.update(inline_rc)
The supplied training data has a sampling rate of 0.5m, which is lower than the industry standard of 0.1524m. This means the the number of observations is a little on the small side, meaning that many ML classifiers will always perform poorly, especially with high entropy datasets.
One workaround to this will be to increase the sampling rate to 0.1m, by using a cubic spline to fill in the gaps in the data. Making up data is generally a no-no, but since wireline logs are generally heavily smoothed by the vendors, this additional step shouldn't add too much error, but will give us 5x more data to play with. We'll do this for each individual well (rather than the whole dataset) to prevent interpolation between wells.
In [10]:
upsampled_data = pd.DataFrame()
for well in training_data['Well Name'].unique():
df = training_data[training_data['Well Name'] == well]
df.index = np.arange(0, 5*len(df), 5)
upsampled_df = pd.DataFrame(index=np.arange(0, 5*len(df)))
upsampled_df = upsampled_df.join(df)
upsampled_df.interpolate(method='cubic', limit=4, inplace=True)
upsampled_df.fillna(method="pad", limit=4, inplace=True)
upsampled_df.drop_duplicates(inplace=True)
if len(upsampled_data) == 0:
upsampled_data = upsampled_df
else:
upsampled_data = upsampled_data.append(upsampled_df, ignore_index=True)
upsampled_data["Facies"] = upsampled_data["Facies"].round()
upsampled_data["Facies"] = upsampled_data["Facies"].astype(int)
upsampled_data["NM_M"] = upsampled_data["NM_M"].round()
upsampled_data["NM_M"] = upsampled_data["NM_M"].astype(int)
# Sometimes a small number of the facies are labelled as 0 or 10 - these need to be removed
upsampled_data = upsampled_data[upsampled_data.Facies != 0]
upsampled_data = upsampled_data[upsampled_data.Facies != 10]
upsampled_data.loc[:,'FaciesLabels'] = upsampled_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
In [11]:
upsampled_data
Out[11]:
Let's check if the facies distributions still look right
In [12]:
upsampled_data.describe()
Out[12]:
In [13]:
facies_counts = upsampled_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors, title='Distribution of Training Data by Facies')
facies_counts
Out[13]:
Looks good! We'll now use this upsampled data as our training data
In [14]:
training_data = upsampled_data
In [15]:
correct_facies_labels = training_data['Facies'].values
well_names = training_data['Well Name']
feature_vectors = training_data.drop(['Formation', 'Well Name', 'Depth','Facies','FaciesLabels'], axis=1)
feature_vectors.describe()
Out[15]:
Scikit includes a preprocessing module that can 'standardize' the data (giving each variable zero mean and unit variance, also called whitening). Many machine learning algorithms assume features will be standard normally distributed data (ie: Gaussian with zero mean and unit variance). The factors used to standardize the training set must be applied to any subsequent feature set that will be input to the classifier. The StandardScalar
class can be fit to the training set, and later used to standardize any training data.
In [16]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)
In [17]:
feature_vectors
Out[17]:
Scikit also includes a handy function to randomly split the training data into training and test sets. The test set contains a small subset of feature vectors that are not used to train the network. Because we know the true facies labels for these examples, we can compare the results of the classifier to the actual facies and determine the accuracy of the model. Let's use 20% of the data for the test set.
In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features, correct_facies_labels, test_size=0.1, random_state=48)
Now we use the cleaned and conditioned training set to create a facies classifier. As mentioned above, we will use a type of machine learning model known as a support vector machine. The SVM is a map of the feature vectors as points in a multi dimensional space, mapped so that examples from different facies are divided by a clear gap that is as wide as possible.
The SVM implementation in scikit-learn takes a number of important parameters. First we create a classifier using the default settings.
In [19]:
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y_train)
predicted_labels = clf.predict(X_test)
Now we can train the classifier using the training set we created above.
Now that the model has been trained on our data, we can use it to predict the facies of the feature vectors in the test set. Because we know the true facies labels of the vectors in the test set, we can use the results to evaluate the accuracy of the classifier.
We need some metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.
The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j]
are equal to the number of observations predicted to have facies j
, but are known to have facies i
.
To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py
in this repo for the display_cm()
function.
In [20]:
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from classification_utilities import display_cm, display_adj_cm
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
The rows of the confusion matrix correspond to the actual facies labels. The columns correspond to the labels assigned by the classifier. For example, consider the first row. For the feature vectors in the test set that actually have label SS
, 23 were correctly indentified as SS
, 21 were classified as CSiS
and 2 were classified as FSiS
.
The entries along the diagonal are the facies that have been correctly classified. Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.
In [21]:
def accuracy(conf):
total_correct = 0.
nb_classes = conf.shape[0]
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
acc = total_correct/sum(sum(conf))
return acc
As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i
, adjacent_facies[i]
is an array of the adjacent facies labels.
In [22]:
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])
def accuracy_adjacent(conf, adjacent_facies):
nb_classes = conf.shape[0]
total_correct = 0.
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
for j in adjacent_facies[i]:
total_correct += conf[i][j]
return total_correct / sum(sum(conf))
In [23]:
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))
The classifier so far has been built with the default parameters. However, we may be able to get improved classification results with optimal parameter choices.
We will consider two parameters. The parameter C
is a regularization factor, and tells the classifier how much we want to avoid misclassifying training examples. A large value of C will try to correctly classify more examples from the training set, but if C
is too large it may 'overfit' the data and fail to generalize when classifying new data. If C
is too small then the model will not be good at fitting outliers and will have a large error on the training set.
The SVM learning algorithm uses a kernel function to compute the distance between feature vectors. Many kernel functions exist, but in this case we are using the radial basis function rbf
kernel (the default). The gamma
parameter describes the size of the radial basis functions, which is how far away two vectors in the feature space need to be to be considered close.
We will train a series of classifiers with different values for C
and gamma
. Two nested loops are used to train a classifier for every possible combination of values in the ranges specified. The classification accuracy is recorded for each combination of parameter values. The results are shown in a series of plots, so the parameter values that give the best classification accuracy on the test set can be selected.
This process is also known as 'cross validation'. Often a separate 'cross validation' dataset will be created in addition to the training and test sets to do model selection. For this tutorial we will just use the train set to choose model parameters within GridSearchCV
.
In [24]:
from sklearn.model_selection import GridSearchCV
parameters = {'C': [.01, 1, 5, 10, 20, 50, 100, 1000, 5000, 10000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 10],
'kernel': ['rbf']} # This could be extended to the linear kernel but it takes a long time
svr = SVC()
clf = GridSearchCV(svr, parameters, n_jobs=-1, verbose=3, scoring="f1_micro")
clf.fit(X_train, y_train)
Out[24]:
Let's plot the GridSearchCV
results in a heatmap
In [25]:
cv_results = {"C": clf.cv_results_["param_C"], "gamma": clf.cv_results_["param_gamma"],
"Score": clf.cv_results_['mean_test_score']}
cv_results = pd.DataFrame(cv_results)
cv_results = cv_results[cv_results.columns].astype(float)
cv_results = cv_results.pivot("C", "gamma", "Score")
plt.figure(figsize=(10, 8))
sns.heatmap(cv_results, annot=True, square=True, cmap="YlGnBu", fmt='.3g')
plt.title('F1 Score');
C = 1000 and gamma = 1 seem to give the best F1 score. Let's try using these against the test dataset
In [26]:
clf_svm = SVC(C=1000, gamma=1)
clf_svm.fit(X_train, y_train)
predicted_labels = clf_svm.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [27]:
from sklearn import tree
parameters = {'max_depth': np.arange(2, 35)}
clf_dt = tree.DecisionTreeClassifier()
clf = GridSearchCV(clf_dt, parameters, n_jobs=-1, verbose=3, scoring="f1_micro")
clf.fit(X_train, y_train)
Out[27]:
In [28]:
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head()
Out[28]:
In [29]:
clf_dt = tree.DecisionTreeClassifier(max_depth=32)
clf_dt.fit(X_train, y_train)
predicted_labels = clf_dt.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [30]:
import xgboost as xgb
clf_xgb = xgb.XGBClassifier()
clf_xgb.fit(X_train, y_train)
predicted_labels = clf_xgb.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [31]:
from sklearn.naive_bayes import GaussianNB
clf_nb = GaussianNB()
clf_nb.fit(X_train, y_train)
predicted_labels = clf_nb.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [32]:
from sklearn.ensemble import AdaBoostClassifier
clf_ab = AdaBoostClassifier(n_estimators=100)
clf_ab.fit(X_train, y_train)
predicted_labels = clf_ab.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [33]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf_rf.fit(X_train, y_train)
predicted_labels = clf_rf.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [34]:
from sklearn import neighbors
parameters = {'n_neighbors': np.arange(1, 25), 'weights': ['uniform', 'distance']}
knn = neighbors.KNeighborsClassifier()
clf = GridSearchCV(knn, parameters, n_jobs=-1, verbose=3, scoring="f1_micro")
clf.fit(X_train, y_train)
Out[34]:
In [35]:
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head(10)
Out[35]:
It looks like several different parameter combinations give the same score. We'll pick the one using the most neighbours: n_neighbors=1
and weights="distance"
(the latter has no effect with k=1)
In [36]:
clf_knn = neighbors.KNeighborsClassifier(1, weights="distance")
clf_knn.fit(X_train, y_train)
predicted_labels = clf_knn.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [37]:
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib import layers
feature_columns = learn.infer_real_valued_columns_from_input(X_train)
clf_dnn = learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[200, 400, 200], n_classes=10)
clf_dnn.fit(X_train, y_train, steps=5000)
predicted_labels = list(clf_dnn.predict(X_test, as_iterable=True))
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
The best classifiers so far are SVM, DecisionTree, KNN, RandomForest and DNN. Lets put all 5 together and take a majority vote to decide the final classification. Due to the way TensorFlow estimators work, they are currently incompatible with the VotingClassifier
in scikit-learn, so we'll do the majority vote using the modal value calculated in a pandas Dataframe
.
Update In practice, using XGBoost instead of the DNN gave slightly better F1 scores.
In [38]:
from sklearn.ensemble import VotingClassifier
eclf = VotingClassifier(estimators=[('SVM', clf_svm),
('DecisionTree', clf_dt),
('KNN', clf_knn),
('RandomForest', clf_rf),
('XGBoost', clf_xgb)
],
voting='hard')
eclf.fit(X_train, y_train)
predicted_labels = eclf.predict(X_test)
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [39]:
display_adj_cm(conf, facies_labels, adjacent_facies, display_metrics=True, hide_zeros=True)
Use a homemade majority voting to use with TensorFlow
In [40]:
# classifiers = {
# "SVM": SVC(C=1000, gamma=1),
# "DecisionTree": tree.DecisionTreeClassifier(max_depth=32),
# "KNN": neighbors.KNeighborsClassifier(1, weights="distance"),
# # "DNN": learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[200, 400, 200], n_classes=10),
# "RandomForest": RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0),
# "XGBoost": xgb.XGBClassifier()
# }
# def fit_and_predict(X_train, y_train, X_test, classifiers):
# predicted_values = {}
# for key, classifier in classifiers.items():
# if key == "DNN":
# classifier.fit(X_train, y_train, steps=5000)
# list(classifier.predict(X_test, as_iterable=True))
# else:
# classifier.fit(X_train, y_train)
# predicted_values[key] = classifier.predict(X_test)
# return pd.DataFrame(predicted_values)
In [41]:
# predicted_values = fit_and_predict(X_train, y_train, X_test, classifiers)
In [42]:
# majority_vote = predicted_values.mode(axis=1)[0]
# conf = confusion_matrix(y_test, majority_vote.fillna(2))
# display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [43]:
# display_adj_cm(conf, facies_labels, adjacent_facies, display_metrics=True, hide_zeros=True)
In [53]:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.metrics import f1_score, classification_report
f1_eclf = []
logo = LeaveOneGroupOut()
for train, test in logo.split(scaled_features, correct_facies_labels, groups=well_names):
eclf.fit(scaled_features[train], correct_facies_labels[train])
pred = eclf.predict(scaled_features[test])
sc = f1_score(correct_facies_labels[test], pred, labels=np.arange(10), average='micro')
well_name = well_names[test[0]]
print("{} {:.3f}".format(well_name, sc))
f1_eclf.append(sc)
# conf = confusion_matrix(correct_facies_labels[test], pred)
# display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# print("")
print("Average leave-one-well-out F1 Score: %6f" % (sum(f1_eclf)/(1.0*(len(f1_eclf)))))
Issue I'm not sure why the leave one well out scores are so much worse than the randomised train/test split (most likely variation between wells due to a combination of geology and well log quality). Need more work in this area
In [45]:
#Load testing data and standardise
test_data = pd.read_csv('../validation_data_nofacies.csv')
test_features = test_data.drop(['Well Name', 'Depth', "Formation"], axis=1)
scaled_test_features = scaler.transform(test_features)
In [46]:
eclf.fit(scaled_features, correct_facies_labels)
predicted_test_labels = eclf.predict(scaled_test_features)
# Save predicted labels
test_data['Facies'] = predicted_test_labels
test_data.to_csv('Anjum48_Prediction_Submission.csv')
In [47]:
# predicted_test_labels = fit_and_predict(scaled_features, correct_facies_labels,
# scaled_test_features, classifiers)
In [48]:
# # Save predicted labels
# test_data['Facies'] = predicted_test_labels.mode(axis=1)[0]
# test_data.to_csv('Anjum48_Prediction_Submission.csv')
In [49]:
# Plot predicted labels
make_facies_log_plot(
test_data[test_data['Well Name'] == 'STUART'],
facies_colors=facies_colors)
make_facies_log_plot(
test_data[test_data['Well Name'] == 'CRAWFORD'],
facies_colors=facies_colors)
mpl.rcParams.update(inline_rc)
Interestingly in the test wells, there appears to be some bad data where the logs appear to be linearly interpolated, e.g. ~3025mMD in Crawford
VotingClassifier
The leave one group out analysis showed a massive reduction in the F1 score. More work is needed to understand why this is, using LOGO type CV would probably be more appropriate for training models
In [ ]: